208 research outputs found
Improving average ranking precision in user searches for biomedical research datasets
Availability of research datasets is keystone for health and life science
study reproducibility and scientific progress. Due to the heterogeneity and
complexity of these data, a main challenge to be overcome by research data
management systems is to provide users with the best answers for their search
queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we
investigate a novel ranking pipeline to improve the search of datasets used in
biomedical experiments. Our system comprises a query expansion model based on
word embeddings, a similarity measure algorithm that takes into consideration
the relevance of the query terms, and a dataset categorisation method that
boosts the rank of datasets matching query constraints. The system was
evaluated using a corpus with 800k datasets and 21 annotated user queries. Our
system provides competitive results when compared to the other challenge
participants. In the official run, it achieved the highest infAP among the
participants, being +22.3% higher than the median infAP of the participant's
best submissions. Overall, it is ranked at top 2 if an aggregated metric using
the best official measures per participant is considered. The query expansion
method showed positive impact on the system's performance increasing our
baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively.
Our similarity measure algorithm seems to be robust, in particular compared to
Divergence From Randomness framework, having smaller performance variations
under different training conditions. Finally, the result categorization did not
have significant impact on the system's performance. We believe that our
solution could be used to enhance biomedical dataset management systems. In
particular, the use of data driven query expansion methods could be an
alternative to the complexity of biomedical terminologies
Design of an Integrated Analytics Platform for Healthcare Assessment Centered on the Episode of Care
Assessing care quality and performance is essential to improve healthcare
processes and population health management. However, due to bad system design
and lack of access to required data, this assessment is often delayed or not
done at all. The goal of our research is to investigate an advanced analytics
platform that enables healthcare quality and performance assessment. We used a
user-centered design approach to identify the system requirements and have the
concept of episode of care as the building block of information for a key
performance indicator analytics system. We implemented architecture and
interface prototypes, and performed a usability test with hospital users with
managerial roles. The results show that by using user-centered design we
created an analytical platform that provides a holistic and integrated view of
the clinical, financial and operational aspects of the institution. Our
encouraging results warrant further studies to understand other aspects of
usability
DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion
This paper outlines the performance evaluation of a system for adverse drug
event normalization, developed by the Data Science for Digital Health group for
the Social Media Mining for Health Applications 2023 shared task 5. Shared task
5 targeted the normalization of adverse drug event mentions in Twitter to
standard concepts from the Medical Dictionary for Regulatory Activities
terminology. Our system hinges on a two-stage approach: BERT fine-tuning for
entity recognition, followed by zero-shot normalization using sentence
transformers and reciprocal-rank fusion. The approach yielded a precision of
44.9%, recall of 40.5%, and an F1-score of 42.6%. It outperformed the median
performance in shared task 5 by 10% and demonstrated the highest performance
among all participants. These results substantiate the effectiveness of our
approach and its potential application for adverse drug event normalization in
the realm of social media text mining
Modélisation et maquettage d’une interface de gestion des métadonnées en bibliothèque
Le modèle conceptuel des métadonnées bibliographiques actuellement le plus répandu en bibliothèque date de la création du catalogage informatique dans les années soixante. Il a été créé dans le but de décrire des ressources physiques et ne correspond plus aux besoins et objectifs des interfaces publiques des catalogues et plateformes d’aujourd’hui. Deux modèles concurrents sont actuellement en train d’être adoptés en bibliothèques : Bibframe et LRM. Ces modèles, plus adaptés à l’écosystème du web des données, favorisent l’échange des métadonnées et ont pour objectif de décloisonner les ressources contenues dans les catalogues des bibliothèques. Tous deux proposent avec plus ou moins de granularité des structures conceptuelles de trois ou quatre niveaux descriptifs et se basent sur de nouveaux standards du web et du domaine de la bibliothéconomie tels que RDF ou RDA. Ce travail consiste à évaluer, au travers d’une recherche qualitative exploratoire, l’adéquation de ces deux modèles et du modèle actuel aux besoins émergents des bibliothécaires qui cataloguent dans les quatre types d’institutions cibles du Réseau des bibliothèques de Suisse occidentale (RERO). Il s’agit également de développer sur la base de plusieurs cas d’utilisation, des modélisations de processus afin de réaliser le maquettage de la nouvelle interface du module de catalogage du SIGB en développement de RERO. Pour y parvenir, nous avons réalisé des entretiens semi-directifs et des observations de catalogage auprès de six bibliothécaires spécialisés en métadonnées bibliographiques, une analyse comparative multicritères, la création de cas d’utilisation et la modélisation UML des processus de catalogage. Le résultat de ce travail consiste en deux parties distinctes. La première est l’analyse qualitative découlant des entretiens. Cette analyse a permis de mettre en exergue le besoin partagé par presque tous les bibliothécaires interrogés d’une souplesse dans les règles de catalogage ainsi qu’une crainte de la complexification du catalogage avec l’adoption de nouveaux standards. Cette analyse ne nous a toutefois pas permis de justifier le choix d’un modèle conceptuel spécifique. Raison pour laquelle la seconde partie de nos résultats : le maquettage de l’interface de catalogage est basé sur un modèle « générique » à trois niveaux descriptifs inspiré par le modèle Bibframe. Il est aujourd’hui important pour RERO de passer à un modèle conceptuel qui permette à la fois, de valoriser ses métadonnées bibliographiques et de rendre l’information contenue dans les catalogues de ses bibliothèques membres plus visibles dans l’écosystème du web des données. et qui n’alourdisse pas les processus de catalogage
Named entity recognition in chemical patents using ensemble of contextual language models
Chemical patent documents describe a broad range of applications holding key
reaction and compound information, such as chemical structure, reaction
formulas, and molecular properties. These informational entities should be
first identified in text passages to be utilized in downstream tasks. Text
mining provides means to extract relevant information from chemical patents
through information extraction techniques. As part of the Information
Extraction task of the Cheminformatics Elsevier Melbourne University challenge,
in this work we study the effectiveness of contextualized language models to
extract reaction information in chemical patents. We assess transformer
architectures trained on a generic and specialised corpora to propose a new
ensemble model. Our best model, based on a majority ensemble approach, achieves
an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show
that ensemble of contextualized language models can provide an effective method
to extract information from chemical patents
Detection of Patients at Risk of Multidrug-Resistant Enterobacteriaceae Infection Using Graph Neural Networks: A Retrospective Study
Funding: This research was funded by the Joint Swiss–Portuguese Academic Program from the University of Applied Sciences and Arts Western Switzerland (HES-SO) and the Fundação para a Ciência e Tecnologia (FCT). S.G.P. also acknowledges FCT for her direct funding (CEECINST/00051/2018) and her research
unit (UIDB/05704/2020). Funders were not involved in the study design, data pre-processing, data analysis, interpretation, or report writing.
Author contributions: R.G. and A.B. designed and implemented the models, and ran the experiments and analyses. R.G. and D.T. wrote the manuscript draft. D.T. and S.G.P. conceptualized the experiments and acquired funding. R.G., D.P., and S.G.P. curated the data. R.G., A.B., D.P., and D.T. analyzed the
data. All authors reviewed and approved the manuscript. Competing interests: The authors declare that they have no competing interests.Background: While Enterobacteriaceae bacteria are commonly found in the healthy human gut, their colonization of other body parts can potentially evolve into serious infections and health threats. We investigate a graph-based machine learning model to predict risks of inpatient colonization by multidrug-resistant (MDR) Enterobacteriaceae. Methods: Colonization prediction was defined as a binary task, where the goal is to predict whether a patient is colonized by MDR Enterobacteriaceae in an undesirable body part during their hospital stay. To capture topological features, interactions among patients and healthcare workers were modeled using a graph structure, where patients are described by nodes and their interactions are described by edges. Then, a graph neural network (GNN) model was trained to learn colonization patterns from the patient network enriched with clinical and spatiotemporal features. Results: The GNN model achieves performance between 0.91 and 0.96 area under the receiver operating characteristic curve (AUROC) when trained in inductive and
transductive settings, respectively, up to 8% above a logistic regression baseline (0.88). Comparing
network topologies, the configuration considering ward-related edges (0.91 inductive, 0.96 transductive)
outperforms the configurations considering caregiver-related edges (0.88, 0.89) and both types of
edges (0.90, 0.94). For the top 3 most prevalent MDR Enterobacteriaceae, the AUROC varies from 0.94
for Citrobacter freundii up to 0.98 for Enterobacter cloacae using the best-performing GNN model.
Conclusion: Topological features via graph modeling improve the performance of machine learning
models for Enterobacteriaceae colonization prediction. GNNs could be used to support infection
prevention and control programs to detect patients at risk of colonization by MDR Enterobacteriaceae
and other bacteria families.info:eu-repo/semantics/publishedVersio
Extraction des concepts biomédicaux des essais cliniques en utilisant le traitement automatique du langage naturel
Les essais cliniques sont des études scientifiques qui permettent d’évaluer l’efficacité de certains médicaments, drogues ou nouvelles méthodes médicales, ainsi que leurs effets secondaires. La plupart du temps, ils se concluent sur un échec. Avoir un outil qui permet d’évaluer le risque d’échec est donc crucial. Ces essais cliniques sont écrits en texte libre, ce qui rend le traitement automatique standard par ordinateur presque impossible. C’est pourquoi l’analyse du langage naturel est utilisée. Le but de ce travail est de créer une base de données qui contient les essais cliniques et les concepts qu’il est possible d’en extraire, pour permettre un traitement automatique dans le futur
Text mining processing pipeline for semi structured data D3.3
Unstructured and semi-structured cohort data contain relevant information about the health condition of a patient, e.g., free text describing disease diagnoses, drugs, medication reasons, which are often not available in structured formats. One of the challenges posed by medical free texts is that there can be several ways of mentioning a concept. Therefore, encoding free text into unambiguous descriptors allows us to leverage the value of the cohort data, in particular, by facilitating its findability and interoperability across cohorts in the project.Named entity recognition and normalization enable the automatic conversion of free text into standard medical concepts. Given the volume of available data shared in the CINECA project, the WP3 text mining working group has developed named entity normalization techniques to obtain standard concepts from unstructured and semi-structured fields available in the cohorts. In this deliverable, we present the methodology used to develop the different text mining tools created by the dedicated SFU, UMCG, EBI, and HES-SO/SIB groups for specific CINECA cohorts
- …